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Abstract — Significant volumes of knowledge have been ac- 
cumulated in recent years linking subtle genetic variations to 
a wide variety of medical disorders from Cystic Fibrosis to 
mental retardation. Nevertheless, there are still great challenges 
in applying this knowledge routinely in the clinic, largely due to 
the relatively tedious and expensive process of DNA sequencing. 
Since the genetic polymorphisms that underlie these disorders 
are relatively rare in the human population, the presence or 
absence of a disease-linked polymorphism can be thought of 
as a sparse signal. Using methods and ideas from compressed 
sensing and group testing, we have developed a cost-effective 
genotyping protocol. In particular, we have adapted our scheme 
to a recently developed class of high throughput DNA sequencing 
technologies, and assembled a mathematical framework that 
has some important distinctions from 'traditional' compressed 
sensing ideas in order to address different biological and technical 
constraints. 



I. INTRODUCTION 

Genotyping, the process of determining the genetic variation 
of a certain trait in an individual, has become a pivotal compo- 
nent of medical genetics, as a broad spectrum of disorders are 
now known to be induced by non-functional genes. In the past 
thirty years, extensive efforts were made to identify and locate 
risk alleles of severe genetic diseases, which are characterized 
by incapacities or lethality of the affected individuals at an 
early age, and very low prevalence in the population. These 
efforts have not only led to deeper insights regarding the 
molecular mechanisms that underlie those genetic disorders, 
but have also contributed to the emergence of large scale 
genetic screens, where individuals are genotyped for a panel 
of risk alleles in order to detect genetic disorders and provide 
early intervention where possible. 

Genetic diseases are broadly classified into two groups 
according to the effect of the underlying mutation - either 
dominant or recessive. This classification is elucidated by 
the diploidy of the human genome, meaning that each gene 
appears in two copies (except the X and Y chromosomes 
in male). Dominant mutation induces a disorder even when 
present only on one chromosome, whereas recessive mutation 
induces the disorder only if both copies are non-functional. 
Thus, for a disease caused by a recessive mutation, individuals 
are classified into three groups : (a) normal if their two 
alleles are intact, (b) carriers if only one allele is functional 
(c) affected if their two alleles are non-functional. Table [l] 
illustrates this classification. A carrier screen is a genetic test 
for detecting individuals that are heterozygous with respect to 
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a risk allele of a severe genetic disease. If two carriers bread, 
they have 25% chance of having an affected offspring for 
each carriage, and in some countries that face high prevalence 
of severe genetic diseases the practice is to offer a screen 
to the entire population, regardless their familial history for 
early monitoring and prevention [1], [2]. Therefore, due to the 
importance and ubiquity in medical gentics and the intriguing 
connection to the theory of sparse signal recovery, our work 
is focused on carrier screens. However, large parts of the 
framework can be used also for other types of genetic screens. 

The most common genotyping method is based on sequenc- 
ing the regions that encompass the risk genes and analyzing 
the type of the DNA sequence - whether it matches the wild 
type (WT) alleles or a known risk allele. This approach gained 
popularity due to its high accuracy (sensitivity and specificy), 
applicability to a wide variety of genetic disorders, and tech- 
nical simplicity. However, the current DNA sequencing tech- 
nologies used for genotyping provide only serial processing 
of one specimen/region combination at a time. This increases 
the cost of labor and other expenses in large-scale screens and 
essentially deters individual participation and Umits the panel 
of genes that are analyzed. 

Recently, a new class of DNA sequencer methods, 
dubbed next-generation sequencing technologies (NGST) has 
emerged, revolutionizing molecular biology and genomics 
(reviewed in [3]-[5]). These sequencers process the DNA 
fragments in parallel and provide millions of sequence reads 
in a single batch, each of which corresponds to a DNA 
molecule within the sample. While there are several types of 
NGST platforms and different sets of sequencing reactions, 
all platforms achieve parallelization using a common concept 
of immobilizing the DNA fragments to a surface, so that 
each fragment occupies a distinct spatial position. When the 
sequencing regants are applied to the surface, they generate 
optical signals dependent on the DNA sequence, which are 
then captured by a microscope and processed. Since the 
fragments are immobilized, successive signals from the same 
spatial location convey the DNA sequence of the correspond- 
ing fragment (Fig. [T^). However, the spatial locations of the 
DNA fragments are completely random and are based on 
stochastic hybridization of a small aliquot (millions) of DNA 
fragments out of significantly larger number present in a 
sample. Therefore, if a DNA library is composed of multiple 
specimens, it is not possible to associate the sequence reads 
with their corresponding specimens. This limitation is the main 
obstacle to the utilization of next generation sequencers in 
large scale screens, since dedicating a run to each specimen 
is not cost effective. 

A simple solution to overcome the specimen-multiplexing 
problem is to append unique identifiers, dubbed DNA bar- 
codes, to each specimen prior to sequencing [6] [7]. These 
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TABLE I 

Genotype-Phenotype Connections in Genetic Disorders 



Alleles 


Genotype 


Dominant 


Recessive 






Disorder 


Disorder 


AA 


Homozygous WT 


Normal 


NoiTnal 


Aa 


Heterozygous 


Affected 


Carrier 


aa 


Homozygous mut. 


Affected 


Affected 



A - normal allele, a - mutant allele 



barcodes are short DNA molecules that are artificially synthe- 
sized, and when attached to the DNA fragment, they label 
it with a unique sequence. The sequencer reads the entire 
fragment, and reports the sequence of the barcode with the 
sequence of the interrogated region. By reading the portion of 
the sequence corresponding to the barcode, the experimenter 
can associate a genotyped fragment to a given specimen (Fig. 
[TJi). While this method was found quite successful for geno- 
typing several dozens of specimens, the synthesis and ligation 
of large number of DNA fragments is both cumbersome and 
expensive. This restricts the scalability of the method for 
genetic screens that consist of thousands of individuals. In fact, 
with the current costs of synthesizing so many DNA barcodes, 
it is more cost beneficial to use the legacy serial-based DNA 
sequencers. 

Drawing inspiration from compressed sensing [8], [9], we 
ask: since only a small fraction of the population are carriers 
of a severe genetic disease, can one employ a compressed 
genotyping protocol to identify those individuals? We suggest 
a protocol in which one genotypes pools of specimens on a 
next-generation sequencing platform that would approach the 
sequencing capacity, while reducing the number of barcodes 
and maintaining a faithful detection of the carriers. 

A. Related work 

Our work closely relates to group testing and compressed 
sensing, which deals with efficient methods for extracting 
sparse information from a small number of aggregated mea- 
surements. Much of the group testing literature (thoroughly 
reviewed in: [10], [11]) is dedicated to the prototypical prob- 
lem, which describes a set of interrogated items that can be 
in an active state or an inactive state and a test procedure, 
which is performed on pools of items, and returns 'inactive' 
if all items in the pool are inactive, or 'active' if at least one 
of the items in the pool is active. Mathematically, this type 
of test can be thought of as an OR operation over the items 
in the pool, and is called superimposition [12]. In general, 
there are two types of test schedules: adaptive schedules, in 
which items are analyzed in successive rounds and re-pooled 
from round to round according to the accumulated results, and 
non-adaptive schedules where the items are pooled and tested 
in a single round. While in theory adaptive schedules require 
less tests, in practice they are more labor intensive and time 
consuming due to the re-pooling steps and the need to wait for 
the test results from the previous round. For that reason, non- 
adaptive schedules are favored, and have been employed for 
several biological applications including finding sequencing 



tagged sites in yeast artificial chromosomes (YAC) [13] , and 
mapping protein interactions [14]. 

Compressed sensing [8], [9] is a recently emerged signal 
processing technique that describes conditions and efficient 
methods for capturing signals that are sparse in some or- 
thonormal representation by measuring a small number of 
linear projections. This theory extends the framework of group 
testing to the recovery of hidden variables that are real (or 
complex) numbers. Additonal deviation from group testing is 
that the aggregated measurements reports the linear combina- 
tion of the data points and not superimposition. However, some 
combinatorial concepts in group testing were found useful 
also for compressed sensing, and it has been recently shown 
that deterministic designs based on group testing can confer 
sublinear reconstruction runtimes for real signals [15]-[17]. 
The framework of compressed sensing was also found useful 
for applications beyond 'traditional' signal processing and re- 
cently a novel biological application was suggested - designing 
highly efficient microarrays that reduces the number of DNA 
probes, which is a factor hampering their miniaturization [18], 
[19]. 

Our approach combines lessons from both fields but also 
has key differences from these frameworks. The most obvious 
deviation of our is that rather than focussing solely on maximal 
reduction in the number of queries (termed 'measurements' in 
compressed sensing, or 'tests' in group testing) additional cost 
functions are introduced. Principally, we are interested in an 
additional objective of minimizing the weight of the design, 
corresponding to the number of nonzero elements of the design 
matrix (for a fixed number of specimens). This constraint 
originates from the properties of next generation sequencers, 
and prevents maximal query reduction. We will discuss the 
consequences of this constraint and provide some theoretical 
bounds and efficient designs. In particular, this theoretical 
framework is built on our recent experimental results regarding 
genotyping thousands of bacterial colonies using combinato- 
rial pooling with NGSTs for a biotechnological application 
[20]. Prabhu et al. [21] develop a closely related theoretical 
approach to detect singletons using error correcting codes. 
A somewhat similar compressed sensing approach has been 
independently developed by Shental et al. [22]. 

The manuscript is divided as follows: In section [ill we set 
up the basic formulation of compressed genotyping. In section 
III we present the concept of light-weight designs and provide 
a lower theoretical bound. Then, we show how constructions 
based on the Chinese Reminder Theorem comes close to this 
bound. In section |IV] we present a Bayesian reconstruction 
approach based on belief propagation, and in section [V] we 
provide several simulations of carrier test, including Cystic 



Fibrosis. Section VI concludes the manuscript. 



II. THE GENOTYPING PROBLEM - PRELIMINARIES 

A. Notations 

We denote matices as an upper-case bold letter and the (z, j) 
element of the matrix X as Xij. The shorthand X denotes a 
matrix that its row vectors are normalized and to sum to 1. 
I(X) is an indicator function that returns a matrix in the same 
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Fig. 1. Common Techniques in Next Generation Sequencing (a) Higii througliput sequencing is employed by immobilizing DNA fragments (rods) on a slide, 
and using sequencing reagents (black ovals) that generate optical signals, (b) DNA barcoding is based on synthesizing short DNA sequences, in the example 
'AAA' and 'GGG', and ligating them to the samples in distinct reactions. When the samples are mixed the barcodes maintain the identities of the specimens. 



size as X with: 



For example: 



1 Xij > 
Xij = 



The genotype of n individuals is represented by an n x s 
matrix G, called the genotype matrix, that is composed of the 
genotype vectors as its rows; Gij denotes how many copies 
of the j-allele the i-individual holds. For example, consider 
the follwing genotyping matrix with 6 individuals and only 2 
alleles: 



1 
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2 3 
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G = 


2 











0.4 


0.6 
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1 


"o l' 






2 





1 1 
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2 _ 



I(X) 



The operation | • | denotes the cardinality of a set or the length 
of a vector. For graphs, da refers to the subset of nodes that 
are connected to the node a, and the notation da\b means the 
subset of nodes that are connected to a except of node b. We 
use natural logarithms. 

B. Genotyping As Bipartite Graph Reconstruction 

Consider a single human specimen that is being genotyped 
for a gene that has two alleles, labled by A and a. We 
represent the genotype of this specimen by a vector of length 
two with three possible outcomes: (2,0) if the specimen 
is homozygous for the A allele, (1,1) if the specimen is 
heterozygous, and (0, 2) if the specimen is homozygous for the 
a allele. This representation can also accommodate situations 
where the gene has more than two alleles in the population by 
increasing the length of the vector to the number of the alleles. 



In that case, the 4*'* individual is a carrier, the 6*'' is affected, 
and the others are normal. 

The genotyping matrix can be represented by a bipartite 
graph. Let G be a bipartite multigraph, G — {X, S, E,p) that 
is built according to the genotype matrix, where X is a set of 
n specimens, 5 is a set of s possible alleles in the population, 
and the edge set, E, denotes which subset of alleles each 
specimen holds. The degree of the specimen nodes, deg(a;i), is 
a constant denoted by p, which represents the genome ploidy[^ 
and in human p ~ 2. The degree of each allele node, deg(si), 
is a random variable that is dictated by the prevalence of the 
genotypes in the population. According to that representation, 
genotyping is in general the task of reconstructing the bipartite 

' this assumption is valid for the vast majority of the genotyping problems, 
but some particular cases that involve copy number variation, such as Spinal 
Muscular Atrophy (SMA) [23], do not have a constant degree in their 
specimen nodes. They will remain outside the scope of this manuscript. In 
addition, for the sex chromosomes in male p = 1. 



SUBMITTED TO TRANSACTION ON INFORMATION THEORY - SPECIAL ISSUE ON MOLECULAR BIOLOGY AND NEUROSCIENCE 



5 




Fig. 2. Cystic Fibrosis A_F508 Screen as a Bipartite Multigrapli Reconstruc- 
tion. There are two allele nodes, the WT and the AF508 mutation. Samples 
1, 2, 3, 5 are WT, while specimen 4 is a canier. The specimen labeled with 
'X' is affected and does not enter to the screen. i?,-isfc is the edge between 
specimen 4 and the 'Mut' node. 



graph from the sequencing information - finding the edge set 
E where X and S are known, subject to p = 2. 

In a carrier screen, one is mainly interested in reconstructing 
a part of the graph, Erisk, that represents the subset of 
individuals that are carriers of the recessive risk alleles. The 
graph is very sparse due to the low prevalence of the risk 
alleles. Moreover, a large portion of severe genetic disease 
exhibit complete penetrance [24], meaning that the affected 
individuals are symptomatic, and therefore are known, and do 
not participate in the screen. Thus, finding that one edge of a 
given individual is connected to a risk allele node immediately 
implies that the other edge is connected to a non-risk allele 
node, which further reduces the degrees of freedom in the 
graph reconstruction. 

Consider two examples of carrier screens. First, consider 
a screen for AF508, the most prevalent mutation in Cystic 
Fibrosis (CF) among people of European descent (Fig. |2]i. In 
that case, the set S has two members: WT and mutant, and the 
expectation of the ratio between Acg{smutant) to deg(svKT) 
is around 1:29 for screens in European [25] [26]. Most of 
specimen nodes sends double edges to the WT node, which 
implies that these specimens carry two copies of the normal 
allele. A small portion of specimen nodes are connected to the 
two different allele nodes, meaning that these specimens are 
carriers for CF. There are no specimens that are connected by 
double edges to the mutant allele, since this mutation always 
leads to CF, and the affected individuals do not participate in 
the screen. Consider also a screen with multiple risk alleles, 
as in the case of FMRl gene that causes Fragile X mental 
retardation [27]. This gene has dozens of alleles, but only a 
small subset causes the disease. Therefore, we need only to 
resolve edges to the risk alleles. However, we are intrested in 
more than a binary classification of the specimens to carriers 
and normals, as the causative alleles carry different degrees of 
disease risk (technically known as penetrance), and identifying 
the exact allele vector has clinical utility. 



C. Defining a Cost-effective Reconstruction 

Following the analysis above, a genotyping assay is a 
query of the form: "which allele nodes are connected to the 
interrogated specimen nodes?". The current DNA sequencing 
methods that rely on serial specimen processing perform this 
query for each individual separately. However, the sparsity and 
the restricted structure of G suggest that Erisk may be found 
in a relatively small number of queries when performing the 
queries on pools of specimens. Fortunately, the sequencing 
capacity of next generation sequencers enables the querying 
of pools of large numbers of specimens. 

We will refer to the task of reconstructing G in the most cost 
effective way as the minimal genotyping problem. Note that 
this task is intentionally not defined as minimizing the number 
of queries, since there are additional factors that determine the 
cost and the feasibility of the procedure. 

We envision a minimal genotyping strategy that is based on 
a non-adaptive query schedule in order minimize the turnover 
time and the need to re-pool the specimens multiple times. Our 
strategy starts by pooling samples of the specimens according 
to a certain design as denoted by which is a t x n 
binary matrix; the columns of $ represent specimens, and 
each row determines a pool of specimens to be queried. For 
example, if the first row of is (1,0,1,0,1...), it specifies 
that the P*, S'^'^, 5*'\ ... specimens are pooled and queried 
(sequenced) together. Since the pooling is carried out using 
a liquid handling robot that can take several specimens in 
every batch, we only consider designs in which all specimens 
are sampled the same number of times to reduce the robotic 
logistics. We define the weight w of $ to be the number of 
times a specimen is sampled, or the number of 1 entries in a 
given column vector: every column is constrained to have the 
same number of I's in this design. Let be the compression 
level of the ith query, namely the number of 1 entries in the 
j-ow, which denotes the number of specimens in the i*'' 
pool. 

In large scale carrier screens, n is typically between few 
thousands to tens of thousands of specimens. We restrict 
ourself to query designs with r„iax 1000 specimens, due 
to technical / biological limitations (in DNA extraction and 
PCR amplification) when processing pools with larger number 
of specimens. A single query in such designs, even when 
composed of 1000 specimens, does not saturate the sequencing 
capacity of next generation platforms. In order to fully exploit 
the capacity, we will pool queries together into query groups 
until the size of each group reaches the sequencing capacity 
limit, and we will sequence those groups in distinct reactions. 
Before pooling the queries, we will label each query with a 
unique DNA barcode in order to retain its identity (for an in- 
depth protocol of this approach see [20]). Thus, the number 
of queries, t, is proportional to the number of DNA barcodes 
that should be synthesized, and one objective of the query 
design, similar to those found in group testing and compressed 
sensing, is to minimize t. 

In practice, once the DNA barcodes are synthesized, there 
is enough material for a few dozens experiments, and one can 
re-use the same barcode reagents for every query group as 
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TABLE II 

Summary of Query Design Parameters 



Notation 


Meaning 


Typical Values 


Comments 




Query design 






n 


Number of specimens 


Thousands 




t 


Number of queries (pools) 






w 


Weight 


<8 


Number of times a specimen is sampled 


^max 


Max. level of compression 


< 1000 


Maximal number of specimens in a pool 




Number of queries in the largest query group 


Up to a few hundreds 


Corresponds to barcode synthesis reactions for a single 
experiment 



these are sequenced in distinct reactions. Hence, the number 
of queries in the largest pooling group, Tmax, dictates the 
synthesis cost for a small series of experiments. While this 
does not change the asymptotic cost (e.g. for synthesizing 
barcodes for a large series of experiments), it has some 
practical implications, and we will include it in our analysis. 

The overall sequencing capacity needed for genotyping is 
proportional to '^i' '^^e total sum of the sizes of the 

queries. By definition nw = X]*=i''i- Thus, the weight 
w determines the requisite sequencing capacity for a given 
number of specimens, and it is an additional factor that should 
be minimized in order to achieve a cost effective design. 
Moreover, decreasing the weight also reduces the number of 
times a specimen is sampled, and consequently, the robot 
time, and the amount of material that is consumed. Notice 
that minimizing the weight of the design is not required by 
traditional compressed sensing construction. 

Next generation sequencers are usually composed of several 
distinct biochemical chambers, called 'lanes', that can be 
processed in a single batch. We assume that the sequencing 
capacity needed for n specimens corresponds to one lana^ 
Since in total nw aliquots of specimens are sampled in the 
pooling step, one needs w lanes to sequence the entire design, 
where each lane is loaded with a different query group. 

We do not intend to specify a global cost function that 
includes the costs of barcode synthesis, robotic time, sequenc- 
ing lanes, and other reagents. Clearly, these costs vary with 
different genotyping strategies, sequencing technologies, and 
so on. Rather, we will present heuristic rules that would 
be applicable in most situations. First, w should not exceed 
the maximal number of lanes that can be processed in a 
single sequencing batch, as launching a run is expensive and 
time consuming, and currently, for the most widespread next- 
generation sequencing platform, w < 8 [29]. Therefore, it is 
also desirable that a design construction will have an explicit 
mechanism to specify the target weight. Second, we assume 
that the cost of adding a sequencing lane is about two to 
three orders more than the cost of synthesizing an additional 
barcode. This will mainly served to benchmark the results of 
our design to the outcome of other designs that were studied 
in group testing. Table |ll] presents the notations we used in 

- recent data have shown that when the number of specimens is a few 
thousand up to tens of thousands this assumption is valid [28] 



that part. 

To conclude this part, the query design for the minimal 
genotyping problem is to find a t x 71 design matrix composed 
of O's and I's 4> that that provides sufficient information to 
reconstruct G, while minimizing t and keeping the column 
sum or weight w below a certain threshold. We term a design 
that addresses these objectives as light weight design. 

D. The Compositional Channel 

The sequencing procedure starts by capturing random DNA 
molecules from the input material, and therefore, the ratios 
of sequence types reflect the corresponding ratios of the 
genotypes in the input material. For example, consider a 
sample that is composed of a mixture of two specimens in 
equal ratio, where one specimen is homozygous WT and the 
other is heterozygous. About 3/4 of the sequence reads will 
correspond to the WT allele and 1/4 to the mutant genotype. 
Since the input material in our case is composed of pools of 
specimens, the sequencing results are given by the following 
conditional probability: 

//3(Y I *G) (1) 

where Y is a t x s matrix that denotes the sequencing results, 
namely the number of sequence reads for each genotype/query 
combination, and G is the n x s biadjacency matrix of the 
genotyping graph. /3 is a sampling parameter, a non-negative 
integer that denotes the number of reads for each query. 
In reality, is a random variable with Poisson distribution, 
since each query has different number of reads. However, 
for simplicity we will treat it as a constant. fpiY \ X) 
denotes a multinomial random process that corresponds to the 
sampling procedure of the sequencers, the joint distribution of 
the sequencing results is therefore given by 

f0{Y I X) = l[a, exp ( -(3 ^ F,, log(l/Z,,) j (2) 

where = jjir^ ^ . As /3 — > 00 the relative ratios of a row 

llj=i 

in Y become similar to the ratios of allele nodes degrees of the 
subgraph induced by the specimens in the pool. We will term 
the process in Eq. ([T]i compositional channel with parameter 
(3. The reason that we used this name is that the channel places 
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s-dimensional real space input vectors in an s — 1 -dimensional 
simplex, which is reminiscent of the concept of compositions 
in data analysis [30]. 

The compositional channel is closely related to two other 
channels, the superimposed channel, and the real adder chan- 
nel. As we mentioned earlier, the superimposed channel has 
been extensively studied in the group testing literature, and 
describes queries that only return the presence or absence of 
the tested feature among the members in the pool. On pooled 
data, measured without noise, the superimposed channel would 
be given by: 

= I(*G) (3) 

The information degradation here is more severe than in the 
case of the compositional channel, since the observer can not 
quantify the number of positive items from a single query with 
positive answer. The output of the compositional channel can 
be further processed as if it was obtained by a superimposed 
channel. In that case denotes only the presence / absence 
of an allele in a query. This degradation is given by: 

Y, = I(Y) (4) 

The real adder channel describes the result of a query as 
a linear combination of the samples in the pool, and is given 
by: 

Y, = *G (5) 

This type of channel serves as the main model for compressed 
sensing, and it captures many physical phenomena. A closely 
related models were studied in group testing for finding 
counterfeit coins with a precise spring scale [31] and in multi- 
access communication [32] [33] [34]. Ideally, when one knows 
the number of specimens in each pool, and /3 — > cx), data from 
the compositional channel can be treated as if it was obtained 
by a real adder channel, since the compositional vectors can 
be placed back in the real space by normalizing Y to Y and 
multiplying the result with the number of specimens in each 
query. 

In reality, the sequencer may also produce errors when read- 
ing the DNA fragments. Since the fragments are composed of a 
barcode region and the interrogated region, sequencing errors 
may lead to association of sequence reads with the wrong 
query, and to an incorrect genotype detection. DNA barcodes 
can be easily extended in order to add more redundancy to the 
codeword that they carry, and experience has demonstrated 
that that errors in barcode annotation are insignificant [35], 
and we will not treat this type of errors. On the other hand, 
sequencing errors in the interrogated region are more involved 
and we will classify them into two categories according to 
their outcome: confounding errors, meaning that a sequence 
read that was derived from one genotype is decoded as another 
valid genotype, and non-sense errors, meaning that a decoded 
sequence read does not correspond to any allele node in G. 
Non-sense errors are easily handled, for instance, by filtering 
those sequence reads, as we assume that all possible alleles are 
known beforehand. Unfortunately, there is no simple remedy 
for confounding errors, and they may distort the observed 
allele ratios in the queries. For instance, consider a pool of 100 
specimens none of which has a mutant allele that is sampled 



TABLE III 

Comparison of Different Channels Models 



Channel model Measurement process Example 

Superimposition OR operation Antibody reactivity 

Compositional Multinomial sampling Next generation sequencing 

Real Adder Additive Spring scale 



with /3 ^ oo. If the confounding error rate is 0.5%, the data 
will falsely indicate that one of the specimens in the pool is 
a carrier The effect of the sequencing errors on genotyping 
may be denoted by a conditional probability which includes a 
confusion matrix: 

fp{Y I *GA) (6) 

A denotes an s x (s + 1) confusion matrix that indicates the 
probability that the z*'* genotype is confounded with the j*-^ 
genotype; the s + 1 column indicates the transition probability 
to a non-sense genotype. We will call the process in Eq. (|6]) 
compositional channel with errors. 

The values of A are dependent on the sequence differences 
between the interrogated alleles and on the specific chemistry 
that is utilized by the sequencing platform. Based on previous 
work regarding the most abundant next generation sequencer 
[28] [36], we presume that subtle mutation differences (known 
as SNPs), such as W12S2X mutation in CF or Y2'i\X 
mutation in Canavan disease, correspond to confounding error 
rates of up to 1%, and when the sequence differences are 
more profound, like in AF508 mutation in CF, we expect 
that the s x s left submatrix in A will resemble the identity 
matrix. Table |lll] summarizes the different channel models in 
this section. 

III. QUERY DESIGN 

A. Constraints On Light-Weight Designs 

Group testing theory suggests a sufficient condition for 
called d-disjunction, that ensures faithful and tractable recon- 
struction of any up to d sparse vector that was obtained from 
a superimposed channel [12]. Since the compositional channel 
can be degraded to a superimposed channel, d-disjunction is 
also a sufficient criterion for reconstructing a d sparse vector 
over a noise free compositional channel. Respecting a carrier 
screen and reconstruction Brisk, if each risk allele node has 
less than d edges, d-disjunction is a sufficient condition to 
reconstruct Brisk given a sufficient sequencing depth and no 
errors. 

Based on the analysis about cost effective designs, we are 
looking for non-trivial d-disjunct matrices that reduce the 
number of barcodes with a minimal increase of the weight. 

Definition 1: 4> is called d-disjunct if and only if the 
Boolean sum of up to d column vectors of the matrix does 
not include any other column vector 

Definition 2: $ is called reasonable if it does not contain 
a row with only a single 1 entry, and its weight is more than 
0. 

We are only interested in reasonable designs. Clearly, if a 
design includes queries composed of single specimens, it is 



SUBMITTED TO TRANSACTION ON INFORMATION THEORY - SPECIAL ISSUE ON MOLECULAR BIOLOGY AND NEUROSCIENCE 



8 



more cost effective to genotype those specimens in serial 
processing. 

Definition 3: Xij is the dot-product of two column vectors 
of 4>, and Amax — max(Aij). 

Lemma 4: The minimal weight of a reasonable d-disjunct 
matrix is: w — d + \. 

Proof: Assume that Xmax occurs between C{i) and C(j), 
two colu mn v ectors in According to definition every 1 
entry in C{i) intersects with at least one column vector Thus, 
there are at most w column vectors that intersect with (7(i|. 
The Boolean sum of those w column vectors includes C(i), 
so the matrix is not w-disjunct. According to definition ([T]i, it 
can not be d-disjunct, and w > d+1. The existence d-disjunct 
matrices with w = d + 1 was proved by Kautz and Singleton 
[12]. ■ 

Definition 5: 4> is called light-weight d-disjunct in case 
w = d+1. 

Lemma 6: $ is a light-weight (w—l) -disjunct iff Xmax — 1 
and 4> is reasonable. 

Proof: First we prove that if 4> is a light-weight (w — 1)- 
disjunct then X^ax — 1- Assume Xr^x occurs between C{i) 
and C{j). According to definition (pb, there is a subset of at 
most w — Xmax + 1 column vectors that C{i) is included in 
their Boolean sum. However, the Boolean sum of any w — 1 
column vectors does not include C{i). Thus, Xmax < 2, and 
according to definition (j2]i, Xmax > 0. Thus, Xmax = 1. In 
the other direction, Kautz and Singelton [12] proved that d — 
[{w — 1)/ Xmax\ , and €> is light-weight according to definition 

©. ■ 

Lemma 7: The number of columns of 4> is bounded by: 



t 



n < 



(7) 



Proof: see Kautz and Singelton [12]. ■ 
Theorem 8: The minimal number of rows, t, in a light- 
weight d-disjunct matrix is t > yjw{w — \)n 
Proof: plug lemma (j6]) to lemma (j7|: 



n < 



n < 





w{w — 1) 
a/ w{w — l)n < t 

■ 

Corollary 9: The minimal number of barcodes is T„iax — 
-^n in a light-weight weight design. 

Proof: There are w query groups, and the bound is 
immediately derived from theorem ([8]l. ■ 
A light weight design is characterized by Xmax = 1. ™d 
t ~ f2(((i + l)\/n) rows. The low Xmax attribute does not 
only increase the disjunction of the matrix but also eliminates 



any short cycle of 4 in the factor graph built upon which 
enhances the convergence of the reconstruction algorithm. We 



will discuss this property in section IV 



B. Light Chinese Design 

We suggest a light weight design construction based on the 
Chinese Remainder Theorem. This construction reduces the 
number of queries to the vicinity of the lower bound derived in 
the previous section, and can be tuned to different weights and 
numbers of specimens. The repetitive structure of the design 
simplifies its translation to robotic instructions, and permits 
easy monitoring. 

Constructing starts by specifying: (a) the number of 
specimens, and (b) the required disjunction, which immedi- 
ately determines the weight. Accordingly, a set of w positive 
integers Q = {qi, . . . ,(7„,}, called query windows, is chosen 
with the following requirement: 



y{qi,qj},i ^ j : lcm{qi,qj) > 



(8) 



where Icm denotes the least common multiplier. We map every 
specimen a; to a residue system (ri, r2, • • • ^r^) according to: 



X = ri (mod qi) 
x = r2 (mod 92) 

X EE Tu, (mod q^) 



(9) 



Then, we create a set of w all-zero sub-matrices 
... called query groups with sizes qi x n. 
The submatrices captures the mapping in Eq. (|9]l by setting 



1 when this clause: x = r (mod qi) is true. Finally, 



we vertically concatenate the submatrices to create $: 

[*2J 



(10) 



For instance, this i^ $ for n — and w = 2, with {qi = 

3,92 = 4}: 



1 
1 
1 





1 
1 



1 

1 

1 





1 





1 



Definition 10: Construction of 4> according to Eq. ( |8] |9] 
[To] ) is called light Chinese design. 

Theorem 11: A light Chinese design is a light weight 
design. 

when X = Qi vje set = qi, so the first row in every submatrix is 1 
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Proof: Let x = Vx (mod qi) and x = Ux (mod qj), 
where i ^ j- According to the Chinese Remainder Theorem 
there is a one-to-one correspondence Vx : x ^ {ux, v^)- Thus, 
every two positive entries in are unique. Consequently, 
\Cx ri Cy\ < 2, and Xmax < 2. Since specimens in the 
form r,r + qi,r + 2qi, . . . are pooled together Xmax ~ 1- 
According to lemma (|6]l 4> is light-weight. ■ 

C. Choosing the Query Windows 

The set of query windows, Q, determines the number of 
rows in 4> as: 



t 



1=1 



(11) 



Since lcm{x, y) — g^j^^ , where gcd is the greatest common 
divisor, minimizing the elements in Q subject to the constraint 
in Eq. ^ implies that qi, . . . ,q^ should be pairwise coprimes 
and qi > y/n. Let k — \y/n] — 1, the definition of the problem 
we seek to solve is as follows: given a threshold, k, and w, a 
vahd solution is a set, R, that contains w co-prime numbers, all 
of which are larger than k. We seek for the optimal solution, 
Q, being the solution satisfying that J2iQ) minimal. 

We begin by introducing a bound on max((5) — k, a value 
we will name 6, or the discrepancy of the optimal solution. 
In order to give an upper bound on 5, let us first consider 
a bound that is not tight, Sq, the discrepancy of the solution 
Qo that is composed of the w smallest primes greater than k. 
Primes near k have a density of l/log(K), so Sq « wlog(K). 
6o is known to be an upper bound on 6 because if any value 
greater than k + Sq appears in the Q, then there is also a prime 
q, K < q < max((3o) that is not used. There is at most one 
value in Q that is not co-prime with q and if it exists it is 
larger than q. Replace it by g in Q (or replace ma,x{Q) with 
q if all numbers in Q are co-prime with q) to reach a better 
solution, contradicting our assumption that Q is the optimal 
solution. 

This upper bound can improved as follows. We know that 
Q C {k, k + 5o]. In this interval, there is at most one value that 
divides any number greater or equal to 6o ■ Consider, therefore, 
the solution Qi composed of the w smallest numbers larger 
than K that have no factors smaller than Sq. In order to assess 
the discrepancy of this solution. Si, note that the density of 
numbers with no factors smaller than So is at least l/log(5o)- 
This can be shown by considering the (lower) density that 
is the density of the numbers with no factors smaller than 
pSf,, where pi indicates the z'th smallest prime. This density 
is given by: 



i<So 



— g^^!<i5o Pi 
^ gZ-^i<5o Pi 

~ p- log(log((5o)) 



(12) 



1 



log((5o) 



where e is Euler's constant and we make use of Y^i^r — ~ 
log(log((5o)), a well-known property of the prime harmonic 
series. Like Sq, the bound Si is also an upper bound on S. 
To show this, consider that the optimal solution Q may have 
z values larger than n + Si in it. If so, there are at least z 
members of Qi absent from it. Replace the z members of Q 
with the absent members of Qi to reach an improved solution. 
We conclude that z — Q and Si > S. 

Theorem 12: For k ^ cxi and large w, S w\og{'w). 
Proof: Consider repeating a similar improvement proce- 
dure as was used to improve from Sq to 5i an arbitrary number 
of times. We define Qi+i as the set of w minimal numbers that 
are greater than k and have no factors smaller than Si, where 
Si is the discrepancy of solution Qi. This creates a series of 
upper bounds for S that is monotone decreasing, and therefore 
converges. Because each Si^i satisfies Si+i w w\og{Si), we 
conclude that the limit will satisfy Soo ~ wlog((5oo), meaning 
S < Soo ^ w \og{'w). This gives an upper bound on S. To prove 
that this bound is tight, we will show that, asymptotically, it 
is not possible to fit w co-prime numbers on an interval of 
size less than w\og{w). To do this, note first that at most one 
number in the set can be even. Fitting w — \ odd numbers 
requires an interval of size at least 2w (up to a constant). The 
remaining numbers can contain at most one value that divides 
by 3. The rest must be either 1 or 2 modulo 3. This indicates 
that they require an interval of at least 2 • More generally, 
if S contains w values, with each of the first w prime numbers 
dividing at most one of said values, then the interval length 
of S must be at least on the order of: 



w 



— = we 



y - 



we 



= wlog{w) 



This gives a lower bound on S equal to the previously 
calculated upper bound, meaning that both bounds are tight. 



Corollary 13: t„ 
Proof: 



fn + wlog(u;) 



max((5) — K = S 

max((5) = \fn + S 
max((3) = \fn + wlog{w) 



Since the number of positive entries in each submatrix is 
the same and equals to n the query groups are formed by 
partitioning 4> to the submatrices. Consequently, Tmax — 
max((5). ■ 
Importantly, the maximal compression level, r^ax, is never 
more than y/n, and the light Chinese design is practical 
for genotyping tens of thousands of specimens. The tight 
bound on S also implies a tight bound on the sum of Q. Let 
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CTQ = Y^^=i 1i ~ We give a tight bound on erg — wk that, 
asymptotically, reaches a 1:1 ratio with the optimal value. 

Theorem 14: The number of queries in the light Chinese 
design is f « Q{'wn + ^w^ log{w)) 

Proof: Proof that this is a lower bound is by induction 
on w. Specifically, let us suppose the claim is true for Qw-i 
and prove for Q^,,. (There is no need to verify the "start" 
of the induction, as any bounded value of w can be said to 
satisfy the approximation up to an additive error.) To prove a 
lower bound, aq^ cannot be better than aQ^_^ + wlog{w), 
as the discrepancy of is known and the partial solution 
Qyj \ ma.x{Qyj) can not be better than To prove that 

this is also an upper bound, consider that the discrepancy of 
Qw is known to be approximately wlog{w), so any prime 
larger than approximately cannot be a factor of more 
than one member of the interval (/t, max(Qw)]. Furthermore, 
the optimal solution for Q^, can not be significantly worse 
than the optimal solution for Q^-i plus the first number that 
is greater than max((5^_i) and has no factors smaller than 
p^. As we have shown before, this number is approximately 
max((5^_i)+log(w). However, we already know the discrep- 
ancy of Qw-i is approximately {w — 1) log(w — 1), so this 
new value is approximately K + {w—l)log{w — l)+log{w) « 
K + wlog{w). Putting everything together, we get that ctq^ < 
o'Q^-i + w log(w) < J2i<w ^ log(i), proving the upper bound. 
The value X]i<to * los(*) is between iw^log(w — 1) and 
\w'^\og{w), so asymptotically aq converges to \w'^\og{w). 

■ 

We will now consider algorithms to actually find Q. First, 
consider an algorithm that begins by setting Tmax to the w 
prime number after k, and then runs an exhaustive search 
through all sets of size w that contain values between k and 
Tmax- This is guaranteed to return the optimal result, and does 
so in complexity 0{{t—k)'^), which is asymptotically equal to 
0((wlog(K))"'). Though this complexity is hyper exponential, 
and so unsuitable for large values of w it may be used for 
smaller w. 

The upper bound described above suggests a polynomial 
algorithm for Q since it is a bound that utilizes sets chosen 
such that none of their elements have prime factors smaller 
than K. This implies the following simplistic algorithm that 
calculates a solution that is asymptotically guaranteed to have 
a 1:1 ratio with the optimal aq. 
1: Let Q be the set of the w smallest primes greater than k. 

2: repeat 

3: (5 <— max((3) — K 

4: (5 <— the w smallest numbers greater than k that have 

no factors smaller than 5 
5: until 5 = max((5) — k 
6: output Q. 

In practice, this is never the optimal solution, as for ex- 
ample, it contains no even numbers. In order to increase 
the probability that we reach the optimal solution (or almost 
the optimal solution), we opt for a greedy version of this 
algorithm. The greedy algorithm begins by producing the set 
of smallest numbers greater than k that have no factors smaller 



than 3 (as in the upper bound). It continues by producing the 
set of smallest co-prime numbers greater than k that have at 
most one distinct factor smaller than 5 (as in the calculation 
of the lower bound). Then, it attempts to add further elements 
with a gradually increasing number of factors. If these attempts 
cause a decrease in 5, it repeats the process with a lower value 
of 5 until reaching stabilization. 

\ : Q ^ initial solution. 

2: repeat 

3: ^ <— max((3) — K 

4: n{x) =^ the number of distinct primes smaller than S 

in the factorization of x. 
5: Sort the numbers k,+1, . . . , max{Q) by increasing n{x) 

[major key] and increasing value [minor key]. 
6: for all i in the sorted list do 

7: if i is co-prime to all members of Q and i < max{Q) 
tlien 

8: replace ma.x{Q) by i in Q. 

9: else if i is co-prime to all members of Q except one, 

q, and i < q then 
10: replace q with i in Q. 

11: end if 
12: end for 
13: until 6 = max{Q) — k 
14: output Q. 

Because this greedy algorithm only improves the solution 
from iteration to iteration, using the output of the first al- 
gorithm described as the initial solution for it guarantees 
that the output will have all asymptotic optimality properties 
proved above. In practice, on the range k = 100 . . . 299 and 
w = 2 ... 8 it gives the exact optimal answer in 91% of the 
cases and an answer that is off by at most 2 in 96% of the 
cases. (Understandably, no answer is off by exactly 1.) The 
worst results for it appear in w = 8, where only 82% of the 
cases were optimal and 88% of the cases were off by at most 
2. 

Notably, due to the fact that k + 1 does not always appear 
in either the optimal solution or the solution returned by the 
greedy algorithm, sub-optimal results tend to appear in streaks: 
a sub-optimal result on a particular k value increases the 
probability of a sub-optimal result on k-I-I (A similar property 
also appears when increasing w), and we denote an interval 
of consecutive k values where the greedy algorithm returns 
sub-optimal results to be a "streak". The number of streaks 
is, perhaps, a better indication for the quality of the algorithm 
than the total number of errors. For the parameter range tested 
(totaling 1400 cases), the greedy algorithm produced 61 sub- 
optimal streaks (of which in only 23 streaks the divergence 
from the optimal was by more than 2). The worst w was 6, 
measuring 14 streaks. The worst-case for divergence by more 
than 2 was w = 8, with 8 streaks. 

In terms of the time complexity of this solution, this can be 
bounded as follows. First, we assvime that the values in the 
relevant range have been factored in advance, so this does 
not contribute to the running time of the algorithm. (This 
factorization is independent of k and w, except in the very 
weak sense that k and w determine what the "relevant" range 
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to factor is.) Next, we note that the initial S is determined 
by searching for w primes, so we begin with a S value on 
the order of wln{K). Each iteration decreases S, so there are 
at most wlog{K) iterations. In each iteration, the majority of 
time is spent on sorting S numbers. Hence, the running time of 
the algorithm is bounded by log{6) or log^((5)(log(w) + 
log(log(K))). Clearly, this is a polynomial solution. In practice, 
it converges in only a few iterations, not requiring the full 
6 potential iterations. In fact, in the tested parameter range 
the algorithm never required more than three iterations in any 
loop, and usually less. (Two iterations in the greedy allocation 
loop is the minimum possible, and an extra iteration over that 
was required in only 4% of the cases.) 

In some cases, it is beneficial to increase the number of 
barcodes from k to ki in order to achieve higher probability 
of faithful reconstruction of signals that are not d sparse. This 
is achieved by finding w integers in the interval {k, ki) that 
follow Eq. ^ and maximize nl^i^j- We give more details 
about this problem in appendix |l] 

Lastly, we note that even though these approximation al- 
gorithms are necessary for large values of w, for small w 
an exhaustive search for the exact optimal solution is not 
prohibitive, even though the complexity of such a solution is 
exponential. One can denote the solution as Q ~ {k + si, k + 
S2, ■ ■ ■ , K+Suj} in which case the values {si, . . . , s^} are only 
dependent on the value of k modulo primes that are smaller 
than the maximal 6 or approximately wlog{w). This means 
that the {si, . . . , s^} values in the optimal solution for any 
{k, w) pair is equal to their values for (k mod P, w), where 
P is the product of all primes smaller than S. Essentially, there 
are only P potential values of k that need to be considered. 
All others are equivalent to them. 

In practice, the number of different k values that need to be 
considered is significantly smaller than this. As an example. 
Fig. |3] gives the complete optimal solution for any value of k 
with w — i. The figure shows that the set S = {si, . . . , s^} 
for any k has only 7 possible values, and that determining 
which set produces the optimal solution for any particular 
value of K can be done by at most 5 Boolean queries regarding 
the value of k modulo specific primes. 




Fig. 3. Optimal Solution using Tree Search 
TABLE IV 

Comparison between Eppstein's Method to the Light Chinese 
Design 



n 


d 


Eppstein 


Li 


^ht Chinese design 


t 


w 






t 


to 


'^max 


"^max 




3 


149 


lot 


29 


1000 


293 


4 


11 


64 


5000 


4 


237 


12t 


37 


714 


370 


5 


11 


64 




5 


336 


15t 


47 


1000 


449 


6 


79 


64 




3 


209 


12t 


37 


sooot 


811 


4 


205 


199 


40000 


4 


335 


14t 


47 


5714t 


1020 


5 


209 


199 




5 


472 


17t 


59 


5714t 


1231 


6 


211 


199 



to create a d-disjunct matrix. The number of queries in their 
method for a given d is: 



t 0(d^log^ n/(logd + loglog7i)) 



(13) 



and the weight is: 



D. Comparison to Logarithmic Designs 

It is well established in group testing theory and in com- 
pressed sensing that certain designs can reach to the vicinity 
of the lower theoretical bound of t ^ 0{dlogn) [37], [38]. 
The t ~ 0{{d + l)^/n) scale in light weight designs raises 
the question whether they are really the most cost effective 
solution for the minimal genotyping problem with thousands 
of specimens and w ^ 8. 

We compared the results of the light Chinese design with 
the method of Eppstein and colleagues [39] for screens with 



w ^ 0{dlogn/ {log d + loglogn)) 



(14) 



5000 and 40000 specimens (Table |IV|). To the best of our 
knowledge, Eppstein's method shows for the general case the 
maximal reduction of t. Interestingly, it is also based on the 
Chinese Remainder Theorem, but without the assertion in Eq. 
(j8]l. Instead, their method requires that Q will be composed 
of co-prime numbers whose product is more than n'^ in order 



Notice that their weight scales with the number of specimens, 
implying that more sequencing lanes and robotic logistic are 
required with the growth of n even if d is constant. 

First, we found that Eppstein's method is not applicable 
to the biological and technical constraints in the genotyping 
setting of w ^ 8 and r^ax ^ 1000 (labeled in the table with 
f ). Second, the differences between the number of barcodes, 
Tmax, and the number of queries in their method are no more 
than ten fold, but their weights are least 2.5 fold greater than 
the weights in the light Chinese design. With the estimated cost 
ratio between barcode to sequencing lane to be around 1 : 100, 
the light Chinese design is more cost effective. Finally, there 
is no intrinsic mechanism in Eppstein's method to specify the 
weight, and to limit it below a threshold. 
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IV. RECONSTRUCTION ALGORITHMS 

A. Bayesian Decoding Using Belief Propagation 

Now, we will turn to address the other part of the minimal 
genotyping problem, which is how to reconstruct G given Y 
and In general, this is an ill-posed inverse problem, but the 
sparsity of G due to the biological constraints (e.g the diploidy 
of the genome and the absence of affected individuals in the 
screen) and the low abundance of rare risk alleles permits such 
decoding. The MAP decoding of the genotyping problem is 
given by: 

Gmap = argmaxPr(a;i, . . . ,a;„ I Y) (15) 

For simplicity, we assume that we do not have any prior 
knowledge on the specimens, beside the expected frequency 
of the genotypes in the screen. Notice that kinship information 
between the specimens and familial history regarding genetic 
diseases are known is some cases and may enhance the decod- 
ing results, however, they will remain outside the scope of this 
manuscript. 77 is a probability vector with length of s{s + 1)/2 
that denotes the expected prevalence of each genotype. For 
instance, for AF508 screen ry = (29/30, 1/30, 0) for normal, 
carriers, and affected, correspondingly. Let Xi be an instance 
of row vector in G, and c{xi) be a binary vector with length 
as T] that maps the allelic configuration of Xi to an entry in a 
list of genotypes. For instance, if there are two alleles in the 
population, Xi is either (2,0), (1,1), or (0,2), and c{xi) is 
(1,0,0), (0,1,0), or (0,0,1), correspondingly. We denote the 
prior probability of Xi by: 

s(s+l)/2 

The prior probability for a certain graph configuration, 
(xi, . . .,Xn), is: 

n 

X{v{xi) (17) 

1=1 

The data is also a subject to factorization, since the result 
of a particular query is solely determined by the specimens in 
the pool: 

t 

Pr(Y I XI, ... , Xn) = n I ^^-) (1^) 

a=l 

we used XQa to denote a configuration of the subset of 
specimens in the a query, and Y^ denotes the a*'' row vector in 
Y. The probability distribution Pr(Ya | xoa) is given by the 
compositional channel model in Eq. (|6| and since we assume 
that P and A are constant for all the queries, we will use the 
following shorthand to denote this probability distribution: 

^a{xaa) = fpixaaA) (19) 
From Eq. ( [T5p9l ), we get: 

t n 

Pr(G) cx II ^aixoa) n ^(^') (20) 

a—1 i—1 

The factorization above is captured by factor graph with 
two types of factor nodes, ip nodes and ^' nodes. The ip nodes 




Fig. 4. Example of Factor Graph for Genotyping Reconstruction 



are uniquely connected to each variable nodes, whereas the 
nodes are connected to the variables according to the query 
design in so each variable node is connected to w different 

nodes. An example of a factor graph with 12 specimens, 
and Q = {3, 4} is given in Fig. |4] 

Belief propagation (sum-product algorithm) [40], [41] is 
a graphical inference technique that is based on exchanging 
messages (beliefs) between factor nodes and variable nodes 
that tune the marginals of the variable nodes to the observed 
data. When a factor graph is a tree the obtained marginals 
are exact; however, a factor graph that is built according to 
any reasonable query design will always contain many loops 
(easily proved by the pigeonhole principle), implying that 
finding Gmap is NP-hard [42]. Surprisingly, it has been found 
that belief propagation can still be used as an approximation 
method for factor graphs with loops. These findings rely on 
the concept that if the local topology of a factor graph is a 
tree-like, the algorithm can still converge with high probability 
[43], [44]. This approach has been successfully used in a broad 
spectrum of NP-hard problems including decoding LDPC 
codes [40], finding assignments in random k-SAT problems 
[45] and even solving Sudoku puzzles [46]. Recently, Mezard 
and colleagues studied the decoding performance of belief 
propagation in the prototypical problem of group testing [47]. 
One advantage of their setting is the presence of 'sure zeros' - 
variables nodes that are connected to at least one 'inactive' test 
node. Since the tests are faultless in the prototypical problem, 
those variables are immediately decoded as 'inactive', and 
are stripped off from the factor graph, which reduces the 
complexity of problem handed to the belief propagation. Un- 
fortunately, the query results from next generation sequencers 
are not reliable, and the absence of an allele node from a query 
may stem from insufficient sequencing coverage (small f3) and 
sequencing errors. Furthermore, the total number of sure zeros 
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can be very small as confounding errors may falsely indicate 
the presence of an allele in a query. From these two reasons, 
stripping has little applicability in our setting. On the other 
hand. Baron and colleagues [48] investigated the performance 
of belief propagation for the recovery of compressed signals 
with a linear channel model and additive white Gaussian 
noise (AWGN). Our approach is reminiscent of their method, 
and employs belief propagation on the full graph using some 
essential shortcuts. 

The marginal probability of Xi is given by the Markov 
property of the factor graph: 



(21) 



a=l 



The approximation made by belief propagation in loopy graphs 
is that the beliefs of the variables in the subset da\xi regarding 
Xi are independent. Since Xmax = 1 in light-weight designs, 
the resulted factor graph does not have any short cycles of girth 
4, implying that the beliefs does not strongly correlated, and 
that the assumption is approximately fullfilled. The algorithm 
defines ^a^xi{xi) as: 

(Xi) = ^ ^aixoa) Yl t^x,^a{Xj) (22) 
{x^da\xi} Xj^da\xi 



and 



Hx,~.a{Xj) = ip{Xj) Yl ^J-u^x.iXj) 



(23) 



were u £ dxj denotes the subset of queries with Xj. Eq. (22i 
describes message from a factor node to a variable node, and 



Eq. (23 I describes message from a variable node to a factor 



node. By iterating between the messages the marginals of the 
variable nodes are gradually obtained, and in case of successful 
decoding the algorithm reaches to a stable point, and reports 

G*: 

G* = argmaxPr(a;i) (24) 

Xi 

This approach encouters a major obstacle - calculating 
the factor to node messages requires summing over all pos- 
sible genotype configurations in the pool, which exponen- 
tially grows with the compression level, r„iax, or ^/n. To 
circumvent that, we use Monte-Carlo sampling instead of 
an exact calculation to find the factor to node messages of 
each round. This is based on drawing random configurations 
of xga according to the probability density functions (pdf) 
that are given by the fJ,x-^a{xj) messages and evaluating 
'^aixda)- An additional complication are strong oscillations in 
which the marginal estimation of Xi for the r step is almost 
completely concentrated in one state, but at the r + 1 step, the 
estimation is completely concentrated in another state. One 
of those states is obviously wrong, and a sampling process 
that uses this pdf to evaluate a factor to node message for 
other variable nodes may find only very small values of 
which is prone to numerical stability issues that ended up in 
sending all-zero messages and failure of the algorithm. We 
used message damping to attenuate the oscillations [49]. The 



damping procedure averages the variable to factor messages 
of the m round with the message of the m — 1 round: 

^^(darnped) ) ^ (^™_^(^^.)) (A^™l3a(^.)) ' (25) 

The extent of the damping can by tuned with 7 e [0, 1]. When 
7=1 there are no updated at all, and when 7 = we restore 
the algorithm in Eq. (23 1. Appendix presents a full layout 
of the belief propagation reconstruction algorithm: 

B. Baseline Reconstruction Algorithm 

In order to benchmark the belief propagation decoding 
algorithm above, we introduce an additional algorithm, named 
pattern consistency decoding, which is used in group testing 
to reconstruct the original data from superimposed channel. 
In a carrier screen, the algorithm first creates a new matrix 
that is composed of the columns in Y that correspond to the 
risk alleles, and then it treats the results in the new matrix 
as superimposition according to Eq. (j4|l. We denote the new 
matrix by Y^s. 

This method does not address query errors, and a specimen 
is defined as a carrier only if all its w queries indicate the 
presence of a risk allele: 



where I is an indicator function: 



1 Xij = w 
otherwise 



(26) 



(27) 



Rows of Erisk with positive entries indicate carriers. This 
reconstruction is guaranteed to be correct if do, the maximal 
number of carriers in the screen for one of the risk alleles, 
is lower than d, the disjunction property of (given no 
sequencing errors and sufficient coverage). Since this recon- 
struction works with degraded information compared to belief 
propagation, we will use it to indicate the baseline performance 
expected from belief propagation decoding, and to test whether 
the approximations we employed (loopy messages, Monte- 
Carlo sampling, damping) are valid. 

V. Numerical Results 

To demonstrate the power of our method, we simulated 
several settings where there is one risk allele and one WT 
allele in the population, with n = 1000, /3 = 10^, w = 5, and 
Q — {33,34,35,37,41}, which can be accommodated in a 
single machine batch. Fig. |5] emphasizes the effect of damping 
on the belief propagation convergence rates. In this example, 
the number of carriers in the screen was do = 43, and we ran 
the decoder for 30 iterations. We evaluated different extents 
of damping: 7 e [0.1, . . . , 0.9], and we measured for each 
iteration the averaged absolute difference in the marginal from 
the previous step. We found that with 7 < 0.5, there are strong 
oscillations and the algorithm does not converge, whereas with 
7 ^ 0.5, there are no oscillations, and the algorithm converges 
and correctly decodes the genotype for all the specimens. 

We also tested the performance of the reconstruction al- 
gorithms for increasing number of carriers in the screen. 



SUBMITTED TO TRANSACTION ON INFORMATION THEORY - SPECIAL ISSUE ON MOLECULAR BIOLOGY AND NEUROSCIENCE 



14 



7=0.1 
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7=0.7 



7=0.8 



10 20 

Fig. 5. The Effect of Damping on Oscillations 



7=0.9 




ranging from 5 to 150, with no sequencing errors (Fig. |6|. 
The belief propagation reconstruction outperformed the pattern 
consistency decoder and reconstructed the genotypes with no 
error even when the number of carriers was 40, which is a 
quite high number for severe genetic diseases. The ability 
of the belief propagation to faithfully reconstruct cases with 
do ^ d-disjunction of the query design is not suprising, since 
d-disjunction is a conservative sufficient condition even for a with steps of 0.5% (Fig. [7|. 
superimposed channel. 



allele and the mutant allele is only a single base substitution, 
and sequencing error may cause genotype confounding. To 
recapitulate that, we introduced increasing levels of symmetric 
confounding errors (i.e the two alleles have the same proba- 
bility of being converted from one to the other), and we tested 
the performance of the reconstruction algorithms with /3 = 10^ 
and (3 = 10^, and with error rates in the range of 0% — 4.5% 



We continue to evaluate the performance of the algorithm 
in a biologically-oriented setting - detecting carriers for CF 
mutation, where the carrier rate in some populations 
is about 1.8% [26]. The relatively high rate of the carri- 
ers challenges our scheme with a difficult genetic screening 
problem. Moreover, the sequence difference between the WT 



As expected, the pattern consistency decoder performed 
poorly (data not shown) even for the lowest error rate of 0.5% 
and marked all specimens as carriers. The belief propagation 
algorithm reported the correct genotype for all specimens even 
when the error rate was 1.5% and (3 — 10^. Importantly, the 
decoding mistakes of the belief propagation at higher error 
rates were false positives, and did not affect the sensitivity of 
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Fig. 6. Decodability as a Function of Number of Carriers 
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Fig. 8. Different Weiglits for CF AF508 Screen Detection 



the method. When we increased the number of reads for each 
query to (3 = 10**, the behef propagation decoder reported the 
genotype of all the specimens without any mistake. As we 
mentioned earlier, the expected confounding error rate for this 
mutation up to 1%, implying that the parameters used in the 
simulation are quite conservative. 

We also tested another CF mutation, AF508, which has 
a similar carrier rate in people with European descents as 
T/F1282X, but contains a 3-nucleotide deletion when com- 
pared to the WT allele. This implies that the confounding 
error rates are negligible, as sequencing-induced deletions are 



quite rare. In this example, we evaluated the effect of different 
weights for the query designs, and we used the following sets 
of query windows: {33,34}, {33,34,35}, {33,34,35,37}, 
{33,34,35,37,41}. Fig. [s] depicts the results for the belief 
propagation algorithm and for the pattern consistency decoder 
While the results are quite poor for w = 2, the belief 
propagation decodes correctly all the specimens with w = 4, 
which would requires the synthesis of only 37 barcodes, and 
a total of 139 queries. 

VI. CONCLUSION 

In this paper, we presented a compressed genotyping frame- 
work that harnesses next generation sequencers for large scale 
genotyping screens of severe genetic diseases. We formulated 
the problem as reconstructing a spares bipartite multigraph 
from information that was obtained over a compositional 
channel. In addition to the traditional objective of minimizing 
the number of queries, we introduced another objective of 
reducing the weight of the design, and we propose a new class 
of designs called light-weight designs in which the weight 
does not depend on n, and only grows linearly with d. For 
the genotyping reconstruction part, we presented a Baysian 
framework that is based on loopy belief propagation, and 
we evaluated its performance by simulating different types of 
carrier tests, including prevalent mutations in Cystic Fibrosis. 

Further investigation is needed to expand the framework 
to include prior biological data such as familial information 
and other predispositions, and to include more types of errors 
beyond those introduced by sequencing, such as biased PCR 
amplification, query failures, and sample contamination. In 
addition it will be interesting to develop a more comprehensive 
treatment for the compositional channel, and to find a less con- 
servative sufficient condition for faithful signal reconstruction. 

Appendix I 
The product maximization algorithm 

The product maximization problem is defined as follows. 
Given parameters k, ki and w, find the set Q of size w whose 
elements are all in the range k < x < ki and such that for no 
pair x,y G Q has lcm{x,y) < k^. For product maximization, 
typical values in practice have k in the range [100,300), w 
in the range [2. 8] and ki fixed at 384. The reason for this 
number is the number of wells in a microtiter plate, which is 
compatible with liquid handling robots. The empirical results 
below relate to this entire range, for all of which we have 
optimal solutions discovered by exhaustive searching. 

The product maximization problem has ties to the sum 
minimization problem in both bound-calculation and solving 
algorithms. First, note that in this problem we cannot consider 
"asymptotic" behavior when w, k and ki are large without 
specifying how the ratio — is constrained. 

If K is constant and ki rises, the asymptotic solution will be 
the set {ki, ki — 1, ki — 2, . . . , ki + 1 — w}. This set clearly 
has the maximum possible product, while at the same time 
satisfying the condition on the Icm because no two elements in 
the solution can have a mutual factor greater than w. This value 
will be the optimum as soon as (ki + 1 — w)(ki+2— w) > wk^ 
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(and possibly even before), so -Jw should be taken as an upper 
bound for ^ to form a non-trivial case. 

For any specific ratio the condition lcm{x,y) > for 

2 

X and y values close to ki is equivalent to gcd{x,y) < 
This allows us to reformulate the question as that of finding 
the set Q with w elements, all less than or equal to k, s.t. the 



13 



all 

k2 J • 



gcd of any pair is lower than or equal to p = 

For the product maximization problem, we redefine the 
discrepancy to be 5 = Ki + 1 — min(Q). In order to compute 
the asymptotic bound for this discrepancy, let us first define 
pseudo-primes. Let the set of fc-pseudo-primes, Pk, be defined 
as the set s.t. i e Pk ■^=^ i > k and -^3j < i,j £ Pk s.t. 
i is divisible by j. The set of 1-pseudo-primes coincides with 
the set of primes. 

One interesting property of fc-pseudo-primes is that they 
coincide with the set of primes for any element larger than 
fc^. To prove this, first note that if i is a prime and i > k 
then i by definition belongs to Pk. Second, note that if i is 
composite and i > k^ then i has at least one divisor larger 
than k. In particular, it must have a smallest divisor larger 
than k, and this divisor cannot have any divisors larger than 
k, meaning that it must belong to Pk. Consequently, i ^ Pk. 

Both the reasoning yielding the upper bound and the rea- 
soning yielding the lower bound for the sum minimization 
problem utilize estimates for the density of numbers not 
divisible by a prime smaller than some d. In order to fit this 
to the product maximization problem, where a gcd of p is 
allowed, we must revise these to estimates for the density 
of numbers not divisible by a p-pseudo-prime smaller than 
5. Because the A;-pseudo-primes and the primes coincide 
beginning with fc^, this density is the same up to an easy- 
to-calculate multiplicative constant jk- 

Knowing this, both upper and lower bound calculations 
can be applied to show that the asymptotic discrepancy of 
the optimal solution is on the order of jkwln{w). This 
discrepancy can be used, as before, to predict an approximate 
optimal product. However, the bound on the product is much 
less informative than the bound on the sum: the product can 
be bounded from above by and from below by (ki — (5)™, 
both converging to a ratio of 1:1 at ki rises. 

The revised greedy algorithm for this problem is given 
exphcitly below. 
1: Let Q be the set of the w largest primes < ki. 
repeat 

5 -t— Ki + 1 — min((5) 

Q <— the «; largest numbers < ki that have no factors 
smaller than 5 
until 5 = Ki + 1 — mm{Q) 
repeat 

S ^ Ki + 1 — min(Q) 

n{x) =' the number of distinct primes smaller than 6 
in the factorization of x. 

Sort the numbers min((5), . . . , ki by increasing n{x) 
[major key] and decreasing value [minor key], 
for all i in the sorted list do 

if Vg e Q, lcm{q, i) > and i > min(Q) then 
replace min((5) by i in Q. 



else if There is exactly one q G Q s.t. lcm{q, i) < k^, 
and i > q then 
14: replace q with i in Q. 

15: end if 
16: end for 

17: until S = Ki + 1 — min((5) 
18: output Q. 

Note that the greedy algorithm tries to lower the discrepancy 
of the solution even when there is no proof that a smaller 
discrepancy will yield an improved solution set. In the sum 
minimization problem, any change of A in any of the variables 
yields a change of A in the solution, so there is little reason to 
favor reducing the largest element of Q (and thereby reducing 
the discrepancy) over reducing any other element of Q. In 
product maximization, however, a change of A to mm{Q) 
(and hence to the discrepancy) corresponds to a larger change 
to the product than a change of A to any other member of 
Q. This makes the greedy algorithm even more suited for the 
product maximization problem than for simi minimization. 

Indeed, when examining the results of the greedy algorithm 
on At = 384, with w e [2,8] and k € [100,300) we see 
that the greedy algorithm produces the correct result in all 
cases w € [2,3,4]. In w E [6,7,8] the algorithm produces 
the optimal result in all but 2,3 and 3 cases, respectively. The 
only w for which a large number of sub-optimal results was 
recorded is w = 5 where the number of sub-optimal results 
was 49. Note, however, that in product maximization there is 
a much larger tendency for "streaking". The 49 sub-optimal 
results all belong to a single streak, where the optimal answer 
is {379, 380, 381, 382, 383} and the answer returned from the 
greedy algorithm is {377,379,382,383,384}. The difference 
in the two products is approximately 0.008%. 

In terms of streaks, the optimal answer was returned in 
all but one streak in w e [5, 6] and in all but two streaks 
in w <E [7,8]. In terms of the number of iterations required, 
the only extra iterations that were needed in the execution of 
the algorithm beyond the minimal required was a single extra 
iteration through the first "repeat" loop when w was 3. In 
all other cases, no extra iterations were used, demonstrating 
that this algorithm is in practice faster than is predicted by its 
(already low-degree polynomial) time complexity. 

Appendix II 
Full Layout of Belief Propagation 

Reconstruction 

1) Inputs: Query design sequencing results Y, prior 
expectations about the genotypes prevalence r], damping 
parameter 7, number of iterations rrimaxt and number of 
Monte Carlo rounds z. 

2) Preprocessing: (a) find /3 - enumerate the number of 
reads in the query, (b) learn the genotype error pattern 
A - the sequencing errors rates are estimated using 
spiked-in controls [36], and converted to genotype error 
according to the sequence of the different alleles, (c) 
calculate ip according to 77. 

3) Initialization InitiaUze the iteration counter m. Initiahze 
lJLxi^a{xi) to priors in ip. 
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4) 



Send messages from factors to vari- 
ables: 

for each factor a in { 1 , . . . , i} do 
for each variable Xi in query a do 
for each state of variable a;^ in {1, . . 



1 
2 
3 
4: 
5: 
6: 



do 



Set *o 
for {1,. 







5) 



7: 

O. 


O 

9 




10 




11 




12 




13 




14 


e 


Send 


tors: 



, z} Monte-Carlo round do 
r <— random configuration of da\x ac- 
cording to pdfs in i^^^.^a 
^'o ^ *o + *a(7-, State of x,) 
end for 

(state of Xi) <— '^a/m 

end for 

Normalize ^a~*xi{xi) 
Send message Ha~^xi{xi) 
end for 
id for 

messages from variables to fac- 



1: for each variable Xi in {I, . . . ,n} do 
2: for each factor a connected to Xi do 
3: Set n'^.^^{xi) to all ones vector 
4: for each possible state of variable Xi in 

{l,...,h|}do 
5: for each factor j connected to Xi except a 

do 

6: state of Xi) = Ai™^a( state of Xi) 

Hj^xi ( state of Xi) 
end for 
end for 

Include prior by Mj*^a(^j) *~ 

^i'^^^^{xi)ip{xi) 

10: Damp M"-^a(2;i) 

11: Normalize ^™_^^{xi) 

12: Send message fi^.^^{xi) 

13: end for 

14: end for 

15: m ^ m + 1 

Go back to step [4] if m < rrimax- 

6) Marginalize: For every variable node compute the 
marginal according to Eq. (21 1, and find the state of 
the variable with the highest probability. 

7) Report: Report the highest state of each variable and 
construct G. 
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